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Abstract 

We showed in this paper that similarity network can be used as an powerful tools to study 
the relationship of tRNA genes. We constructed a network of 3719 tRNA gene sequences using 
simplest alignment and studied its topology, degree distribution and clustering coefficient. It is 
found that the behavior of the network shift from fluctuated distribution to scale-free distribution 
when the similarity degree of the tRNA gene sequences increase. tRNA gene sequences with the 
same anticodon identity are more self-organized than the tRNA gene sequences with different 
anticodon identities and form local clusters in the network. An interesting finding in our studied 
is some vertices of the local cluster have a high connection with other local clusters, the probable 
reason is given. Moreover, a network constructed by the same number of random tRNA sequences 
is used to make comparisons. The relationships between properties of the tRNA similarity network 
and the characters of tRNA evolutionary history are discussed. 
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I. INTRODUCTION 



Transfer ribonucleic acid, or tRNA for short, is an important molecule which transmits 
genetic information from DNA to protein in molecular biology. It has been known that all 
tRNAs share a common primary, secondary, tertiary structure. Most tRNA sequences have 
a "CCA" hat in terminus 5' and a polyA tail in terminus 3' in its primary structure. Its 
secondary structure is represented by a cloverleaf. They have four base-paired stems and a 
variable stem, defining three stem loops (the D loop, anticodon loop, and T loop) and the 
acceptor stem, to which oligonucleotides are added in the charging step 1]. Variable loop 
varies in length from 4 to 13 nt, some of the longer variable loops contain base-paired stems. 
The tRNAs also share a common three-dimensional shape, which resembles an inverted 
"L". Though much effort had been put on tRNA research in the past time, little is known 
about specific features of tRNA that are exclusive to a species, taxa or phylogenetic domain 
level Q]. With the progress of genome projects, a vast amount of nucleotide sequence data 
of tRNA is now available, which makes it possible to study the tRNA genes expression for 
a wide range of organisms. 

Recently scientists are trying to find specific feature in genes families by a new tool — 
complex networks. With the development of techniques on oligonucleotide or cDNA arrays, 
using gene chips to erect a complicated network and studying its feature and evolution has 
become a hot subject, and has gained a success 3, U, 0, Q. Basically, the networks can 
be classified into two types in terms of its degree distributions p(k) of nodes: exponential 
networks and scale-free networks. The former type has a prominent character that although 
not all nodes in that kind of network would be connected to the same degree, most would have 
a number of connections hovering around a small, average value, i.e. k ~ (A;), where k is the 
number of edges connected to a node and is called degree of the node. The distribution leads 
to a Poisson or exponential distribution, such as random graph model [7] and small- world 
model[8], which is also called homogenous networks. The latter type network has a feature 
that some nodes act as "very connected" hubs which have very large numbers of connections, 
but most of the nodes have small numbers of connections. Its degree distribution is a power- 
law distribution, p(k) ~ A; -7 . It is called inhomogeneous network, or scale-free networkj^]. 

The tRNA sequences have similarities in sequences and structure, which make it possible 
to construct networks and use specialized clustering techniques to make classification. The 
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similarity of tRNA sequences suggests their relationships in evolutionary history. If we 
consider all the tRNA sequences at present evolve from common ancestor via mutation, 
the sequence similarity will reveal their evolutionary affiliation. There are lots of tRNA 
sequences. The similarities of every two of the sequences are different. Lots of data will 
be dealt with. Since complex network is a good model to describe and study complex 
relationships, the network model may be useful in this field. In this paper, we constructed a 
similarity network of 3917 tRNA genes in order to show network model is a powerful tool to 
study the evolutionary relationships among the tRNA genes. The topology of the network is 
discussed, the degree distribution and clustering coefficient are considered, and the network 
constructed by the same number of random tRNA sequences is used to make comparisons. 



II. MATERIALS AND METHODS 



A. tRNA sequences 

Transfer RNA sequence have been collected into database by Sprinzl et al[10jm 1974. 
All of our data, 3719 tRNA genes sequences, are retrieved from this database (free avail- 
able at http: / /www.uni-bayreuth.de/departaments/biochemie/sprinzl/trna/ ), which includ- 
ing 61 anticodon subsets, 429 species, and 3 kingdoms: Archaea, Bacteria, and Eucarya. 
Each tRNA sequence has 99 bases when the variable stem is considered. For convenience 
of alignment, the absent bases in some positions of the tRNAs are inserted with "blank". 
Firstly, we align the tRNA sequences with the same anticodons, and then align all 3719 
tRNAs. Since there have been too many conclusions proving that tRNA genes have a high 
similarity in sequences jll, 12, 1(|, the results of the alignment of 3719 tRNA gene sequences 
will not be listed in detail. We only focus on some prominent characters of the statistics of 
the alignment. 



B. tRNA sequences network 

If each base including the inserted "blank" is considered equally, the length of a tRNA 
is L = 99. To align two tRNA sequences, a parameter s is used to depict their similarity 
degree, which indicates how many bases in the same position of two tRNA gene sequences 
are identical. For example, if the first bases of two tRNA sequences both are A, one score is 
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added to s. Obviously, < s < 99. Although it is the simplest kind of alignment, as we show 
later, it gives lots of information of the relationships among tRNA genes. When s = 99, it 
means two sequences are matched perfectly. Since the perfectly matched sequences have the 
the same significance in biology, we take only one of them as a representative. To construct 
the tRNA similarity network, every sequence is considered as a node. If the alignment score 
s of two tRNA sequences is larger than a given similarity degree sq, put an edge between 
the corresponding nodes. Obviously, if so is small, the nodes will connect closely, and when 
So grows larger, the number of connections will decrease. 

For comparison, we make a similarity network of the same number of random tRNA 
genes. To generate the random tRNA genes, every base of the sequences is randomly taken 
from the four bases (C, G, A and T) and the sequences must conform to the prototype of the 
real tRNA, which means the sequences we generate randomly must confirm the secondary 
structure of tRNA. 



III. GRAPH TOOL 



Pajek (the Slovene word for spider), a program for large- network analysis [13( (free avail- 
able at http://vlado.fmf.uni-lj.si/pub/networks/pajek/), was used to map the topology of 



the network. 



IV. RESULTS 



A. The topology of network 

Figure Q displays several typical topologies of the similarity network of different kinds 
of tRNA gene sequences. Figure ^ (a), (b) and (c) are similarity networks constructed by 
tRNA genes with the same anticodons (CGC, CCA and TGC respectively) and Sq = 60. 
The networks of tRNA genes with the same anticodon identity are highly clustered. Some 
of them divide into two or more clusters, such as figure ^c). Each of the clusters almost 
entirely connected when s is small. When sq grows large, the connection number decreases, 
and the network becomes not so closely connected. Figured) is the similarity network of 
anticodon GTT when sq = 80. As more nodes added in the network, the network becomes 
more complex. Figure d(e), (f) shows the network with a large N (the number of nodes). 
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(e) is the network containing anticodons CAT and GCC with So = 80, and (f) is the network 
of all 3420 tRNA sequences with So = 90. Small local clusters with the same anticodons get 
together to form a large cluster, "very connected" hubs can be observed in the center of the 
network (figure ^ (f))- At a large similarity degree, the scale free property (or power law 
distribution) emerges, which means a few nodes have a large degree (number of connections), 
but most nodes have a small degree. To make the figure ^ (e) more visualized, we extracted 
the nodes whose connections number is bigger than 25 to make the figure El It also has hubs 
in the center of the network. Of course, the hubs are smaller. The scale free property is still 
kept. 

The distribution of the connected probability of the networks of the tRNA genes with the 
same anticodon is shown in table HJ The connected probability is defined as the fraction of 
number of real connections to the largest number of possible connections. In the table it can 
be found, when so = 50, the network is almost entirely connected and most of the connected 
probabilities are larger than 0.8; when sq = 90, most of the connected probabilities decrease 
to one tenth of the former, and some decrease to zero. 

Consider the network of random tRNA sequences in the same size. When similarity 
degree Sq is small, most of the nodes have the same number of connections. When s 
increases, the number of the edges of the network decreases sharply and most of the nodes 
lose their links; only few of them have two or three edges linked. Table ITT1 shows the statistics 
of the connection numbers of real tRNA similarity network and random tRNA similarity 
network at different similarity degrees. The table shows that when so = 50, the number 
of the connections of the two networks are very large; and when sq = 90, both of them 
drop, but the random one drops more quickly than real one does. The connection number of 
real tRNA network n rea i drops from 3434403 (So = 50) to 3429 (So = 90). The connection 
number of random tRNA network n ran dom drop from 4321688 (So = 50) to when So = 80. 
It shows the real tRNA sequences have more similarity with each other than random ones 
do. In other words, the real tRNA sequences are not randomly taken. If we consider that 
the real tRNA genes have evolutionary relationships, the differences between the statistics 
of real and random tRNA similarity networks shown above can be explained to a certain 
extent. 
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B. Degree distributions 



It has already been found that networks constructed of the large scale organization of 
genomic sequence segments display a transition from a Gaussian distribution via a truncated 
power-law to a real power-law shaped connectivity distribution towards increasing segment 
size. [14]. The similarity networks of tRNA sequences have similar features. The investiga- 
tions begin with an important parameter, degree distribution p(k) of the nodes, and the 
analysis is considered in figure El 

As observed in Figure El with the similarity degree Sq increasing, p(k) behaves more and 
more similar to power-law distribution. When s = 50, degree distribution p(k) of the nodes 
follows a uninterrupted fluctuated distribution. For those k < 1088, Np{k) fluctuate from 1 
to 3; and for those k > 1800, Np(k) fluctuate from 1 to 9, and the peak of the fluctuation 
is at k = 2600. The mean degree (k) = 2008, and the maximal degree fc max = 3052. When 
s = 60, the peak of the fluctuation deviates to left, at k = 100. When s = 70, the 
distribution of p(k) appears a analogous power-law distribution if ignore the minimal value 
of k. For So > 70, the distribution transits from a analogous power-law distribution to a real 
power-law. As shown in figure ED^e), when s = 90, the distribution curve fits the power-law 
perfectly. The fitting result is p(k) = 0.192A;- 1036 - 0.006. 

Comparing to the real tRNA gene sequences, the degree distribution of the network of 
random tRNA sequences, when s = 50, is a Gaussion distribution (figure Eff)). Most nodes 
have approximately the same degree, k « (k) = 2527; the maximal degree fc max = 2895 
and the minimal degree k min = 2327. When s = 60, the distribution is almost unchanged 
(figure Elg))- When s = 70, the number of the edges descend sharply with its maximal 
degree k = 5. In figure H24), (g), there are lower peaks except the main peaks of the 
Gaussion distribution. It is possibly because the random tRNA sequences are not generated 
completely arbitrarily for they must conform to the prototype of the real tRNA. 

From above data analysis, we can conclude the real tRNA genes are more self-organized 
than the random tRNA genes. The power-law distribution means there are a few tRNA 
genes which behave as "very connected" hubs of the similarity network. Lots of tRNA 
genes are similar with them in arranging of sequences. If we suppose all the tRNA genes 
come from common ancestor, it is possible that the "very connected" tRNA genes will have 
more relationships with the ancestor than other tRNA genes do. In other words, the "very 
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connected" tRNA genes probably diverge less from ancestral sequences than other tRNA 
genes do in the evolutionary history. In mathematics, a way to construct a scale free network 
is to follow a rule that an added node has much more possibility to connect with a node with 

Q 

a large degree than to connect with a nod with a small degree |9J]. In the tRNA similarity 
network, it maybe means the tRNA genes which have small degrees diverged more from 
ancestor sequences and is less stable than the tRNA genes which have large degrees. 



C. Clustering coefficient 

If a node connect with i other nodes and there are j edges connected within these i nodes, 
the clustering coefficient of the original node is defined as 



where % {% — 1) /2 is the total number of possible connections among i nodes. Clustering coef- 
ficient reflects relationships of the neighbors of a node, and quantifies the inherent tendency 
of the network to clustering. As shown in Figure El the average clustering coefficient c roa i of 
the real tRNA network is larger than the random one. As Sq increase, c rea i decrease. When 
So = 60, it approaches a local minimum and experience a little increase and then decreases 
slowly again. Comparing with the average clustering coefficient of the tRNA network, the 
average clustering coefficient c ran dom of the random network decreases fast while so increases, 
when s > 70, c ran d om ~ » 0. The behavior of the coefficient of two networks is also illustrated 
in table HU When so = 50, c rea i = 0.777367, c ran dom = 0.747479; when s > 70, c ran dom drops 
to zero quickly, but c rea i decrease slowly. Once again, we proved the real tRNA genes are 
not randomly selected. The real tRNA genes have close relationships with each other. 

Table IIHI shows the distribution of the average clustering coefficient of 19 tRNA groups 
which are classfied by the possible amino acid-accepting. Some groups contain isoacceptor 
tRNA which consist of different tRNA species that bind to alternate codons for the same 
amino acid residue. The tRNA group who carries the amino acid residue named Met is 
ignored for it contains only one tRNA sequence. Comparing table IIHI with table [0J we can 
conclude that the nodes are more likely to connect with the nodes within the same amino 
acid group. The tRNA similarity network can be classified into several large clusters with 
the same amino acids. It hints that in tRNA genes evolutionary step is much more likely to 
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happen within the same amino acid group. The cases that a tRNA gene of certain amino 
acid evolve to tRNA gene of another amino acid are rare. 



V. DISCUSSION 



In this paper, we want to show the network model is a powerful tool to study the relation- 
ship of tRNA genes. Although some results are not new, such as the real tRNA genes are 
not random and the relationships among tRNA genes with same anticodon are closer than 
the relationships among tRNA genes with different anticodons, they are evidences that net- 
work model works well for the network model distinguishes these properties clearly. What 
is more, the tRNA similarity network behaves scale-free properties when sq is large. As we 
know the scale-free nature is rooted in two generic mechanisms |J. Firstly scale-free networks 
describe open systems that grow by the continuous addition of new nodes. Secondly scale- 
free networks exhibit preferential attachment that means the likelihood of connecting to a 
node depends on the node's degree. With these mechanisms, the "very connected" nodes in 
scale-free networks usually are added in the network at early time during the growth of the 
network. It has been found that most recent tRNA genes are evolved from a few common 
p_sQQ, and these o.dest evoMiooary _ coding to the recent tRNA 
genes. Therefore, in tRNA similarity netwok, the "very connected" tRNA genes may have 
diverged less from their ancestors than weakly connected ones. 

Most recently, many research conclusions show that genes of related function could behave 
together as a group in the networks constructed according to their similarity features 0, 
0). In this paper, although we use the simplest alignment, this property can be found. 
When similarity degree so is small, nodes of the tRNA genes with the same anticodons are 
connected to form a local cluster, among them are entirely connected. When s increases 
to a large value, a scale-free character emerges that a few nodes compose the core of the 
network and most of nodes have low links. These observations seem to be perfectly fit to 
the evolutionary processes of the tRNA genes. On the other hand, the oldest tRNA genes 
undergo disturbances such as mutation, loss, insertion, or rearrangement etc. during the 
evolution. Some new tRNA genes are suited for the environment and reserved. So, they 
have a high similarity to its ancestral sequences. In the network constructed by similarity 
degree of these tRNA genes, they form local clusters. 
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An interesting finding of tRNA similarity networks is that some local clusters have high 
connectivity with the other clusters; or to say, some nodes of one cluster have lots of con- 
nections with some nodes of another cluster. See figure 03 It may hint that the evolution 
relationship of tRNA sequences of two different anitcodons. As shown in figure^a), the 
network is of two different anticodons: ACG and CCA. The solid circle nodes are the tRNA 
genes of ACG, and the hollow circle nodes are the tRNA genes of CCA. In this figure, they 
mix into one cluster. Figure Efb) shows that the network of anticodons TAG and TGA. The 
solid circle nodes are the tRNA genes of TAG, and the hollow circle nodes are tRNA genes 
of TGA. They appear three clusters in the topology map, and each cluster has some nodes 
which highly connect with some nodes of other clusters. It shows that although some tRNA 
genes have different anticodons, they have high similarities in sequences. In evolutionary 
history, the tRNA genes of one anticodon identity can evolve to tRNA genes of another 
identity. The above finding may be an evidence of this kind of evolutionary mode. In the 
other hand, from figure QJc), the network of the same anticodon GCC split into two cluster. 
It hints the evolution process of the tRNA genes of same anticodon may diverge in the 
history. Therefore, there are different modes of evolutionary processes, i.e. evolution within 
the same anticodon groups and evolution among different anticodon groups. The former 
may be the main part of tRNA evolution. The later may be the key cases of the interaction 
among tRNA of different anticodons during the evolution. 

For the alignment we used is simply counting the number of cites that are identical, 
it losts many information in the evolution process. More complicated alignment models 
may exhibit more details of the relationships among tRNA genes. The content of tRNA 
database is limited, the numbers of tRNA sequences from different organisms varied largely. 
Therefore, the biases of taxon samples may influence the topology of the network and the 
results gotten from the network may not completely reflect the evolution relationship of 
tRNA genes. It is a limitation of network model that will be improved when more tRNA 
genes are sequenced. Although we did not get many new results from what we have know 
about the evolution of tRNA genes, the results contribute as proofs that the network model 
can work well in the research of relationship of tRNA genes and is a useful tool. 
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TABLE I: The distribution of the connected probability of all 57 anticodons' tRNAs networks, 
which have excluded four anticodons for they have too small vertices. The statistic shows that when 
s=50, many networks are complete connection; when so=90, the connected probability decreasing 
sharply, some of the connected probability decrease to zero 
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Sq Tl rea i ^random Creal ^random 



50 3434403 4321688 0.777367 0.747479 
60 994571 367845 0.541708 0.139572 
70 142264 773 0.578806 0.000682 
80 19453 0.567380 0.000000 

90 4249 0.286254 0.000000 



TABLE II: The number of edges and average cluster coefficients of two networks respective to 
similarity degrees. The number of nodes is 3420. Sq: similarity degree; n: the number of edges of 
the network; c: average cluster coefficient 
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0. 


.200635 


HIS 


0. 


.93883 





.745659 


0. 


.716469 


0. 


.677458 


0. 


.406243 


LEU 


0. 


.921699 





.681914 


0. 


.733301 


0. 


609716 


0. 


.313292 


LYS 





,722222 





.666667 


0. 


,666667 












PHE 





.877036 





.678511 


0. 


.744765 


0. 


.621813 





,287326 


PRO 





.97332 





.848348 


0. 


.546192 


0. 


.447778 





.18366 


SER 


0. 


.856441 





.666374 





.649365 


0. 


.542714 





.175925 


STOP 


,758638 





.707049 


0. 


,731592 


0. 


.57147 


0. 


,217136 


THR 


0. 


,93964 


0. 


.831404 


0. 


,6121 


0. 


.585127 


0. 


,309603 


TRP 


0. 


.9265 





.789019 


0. 


.754142 


0. 


.581435 


0. 


.225642 


TYR 


0. 


.932773 





.779707 


0. 


.684642 


0. 


.558649 


0. 


.306018 


ARG 


0. 


.805026 





.834632 


0. 


.632049 


0. 


.375894 


0. 


.147186 



TABLE III: The average clustering coefficient of 19 tRNA possible aminoacid-accepting groups' 
networks, each network is named using three-letter amino acid abbreviations. 
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FIG. 1: the topology of the network, (a), (b), (c), (d)) are the topology of network of the same 
anticodons. The three capital letters are the three anticodons subsets of tRNA genes, (a): CGC, 
S = 60, N = 6, P = 1.0; (b): CCA,S = 60, N = 150, P = 0.8414; (c): TGC S = GO, N = 
215, P = 0.5892 (d): GTT S = 80, N = 145, P = 0.028)). (e), (f) are the topology of network 
of different anticodons. (e): So = 80, iV = 304, P = 0.04, network of CAT and GCC; (f): 
So = 90) iV = 3420, P = 0.0034. So is the similarity degree; is the number of the nodes; P is 
the connection probability. 
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FIG. 3: (a), (b), (c), (d), (e)are the degree distribution of the tRNA gene sequences network, 
N=3420; The line in (e) is power law fitting of the data. The formula is p(jfe) = 0.192£r L036 - 0.006. 
(f), (g) are the degree distribution of the random tRNA sequence network, N=3420. 
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Similar degree 



FIG. 4: The distribution of the clustering coefficient of the two network according to their similarity 
degree 
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(a) 




FIG. 5: Cluster of network of tRNA genes of different anticodons. They are segments of the 
topology of 3420 tRNA genes, (a) Composing 96 vertices and 97 edges, similarity degree So is 
60, contain anticodons: ACG (solid circle) and CCA (hollow circle); (b) Composing 226 vertices 
and 227 edges, similarity degree So is 60, contain anticodons: TAG (solid circle) and TGA (hollow 
circle). 
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